Tutorial 🐧
This is a gentle and lighthearted tutorial on how to use tools from SplitApplyPlot, using as example dataset a collection of measurements on penguins[1]. See the Palmer penguins website for more information.
using CSV, DataFrames, HTTP
url = "https://cdn.jsdelivr.net/gh/allisonhorst/palmerpenguins@433439c8b013eff3d36c847bb7a27fa0d7e353d8/inst/extdata/penguins.csv"
penguins = dropmissing(CSV.read(HTTP.get(url).body, DataFrame, missingstring="NA"))
first(penguins, 6)| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| String | String | Float64 | Float64 | Int64 | Int64 | String | |
| 1 | Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male |
| 2 | Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female |
| 3 | Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female |
| 5 | Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male |
| 6 | Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | female |
Frequency plots
Let us start by getting a rough idea of how the data is distributed
using SplitApplyPlot, CairoMakie
specs = data(penguins) * frequency() * mapping(:species)
draw(specs)Next, let us see whether the distribution is the same across islands.
specs * mapping(color = :island) |> drawUps! The bars are in the same spot and are hiding each other. We need to specify how we want to fix this. Bars can either dodge each other, or be stacked on top of each other.
specs * mapping(color = :island, dodge = :island) |> drawThis is our first finding. Adelie is the only species of penguins that can be found on all three islands. To be able to see both which species is more numerous and how different species are distributed across islands in a unique plot, we could have used stack.
specs * mapping(color = :island, stack = :island) |> drawCorrelating two variables
Now that we have understood the distribution of these three penguin species, we can start analyzing their features.
specs = data(penguins) * mapping(:bill_length_mm, :bill_depth_mm)
draw(specs)We would actually prefer to visualize these measures in centimeters, and to have cleaner axes labels. As we want this setting to be preserved in all of our bill visualizations, let us save it in the variable specs.
specs = data(penguins) * mapping(
:bill_length_mm => (t -> t / 10) => "bill length (cm)",
:bill_depth_mm => (t -> t / 10) => "bill depth (cm)",
)
draw(specs)Much better! Note the parentheses around the function t -> t / 10. They are necessary to specify that the function maps t to t / 10, and not to t / 10 => "bill length (cm)".
There does not seem to be a strong correlation between the two dimensions, which is odd. Maybe dividing the data by species will help.
specs * mapping(color = :species) |> drawHa! Within each species, penguins with a longer bill also have a deeper bill. We can confirm that with a linear regression
an = linear()
specs * an * mapping(color = :species) |> drawThis unfortunately no longer shows our data! We can use + to plot both things on top of each other:
specs * an * mapping(color = :species) + specs * mapping(color = :species) |> drawNote that the above expression seems a bit redundant, as we wrote the same thing twice. We can "factor it out" as follows
specs * (an + mapping()) * mapping(color = :species) |> drawwhere mapping() is a neutral multiplicative element. Of course, the above could be refactored as
ans = linear() + mapping()
specs * ans * mapping(color = :species) |> drawWe could actually take advantage of the spare mapping() and use it to pass some extra info to the scatter, while still using all the species members to compute the linear fit.
ans = linear() + mapping(marker = :sex)
specs * ans * mapping(color = :species) |> drawThis plot is getting a little bit crowded. We could instead analyze female and male penguins in separate subplots.
ans = linear() + mapping(col = :sex)
specs * ans * mapping(color = :species) |> drawSmooth density plots
An alternative approach to understanding the joint distribution is to consider the joint probability density distribution (pdf) of the two variables.
using SplitApplyPlot: density
an = density()
specs * an * mapping(col = :species) |> draw